Why visualize?
Terminology
Question: What would be an effective way to visualize:
There are many python packages for visualization.
We'll begin with visualization in pandas and focus on matplotlib. There is great documentation on all of this. The case study is to analyze the flow of bicycles out of stations in the Pronto trip data. In this section, we'll discuss:
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
# The following ensures that the plots are in the notebook
%matplotlib inline
# We'll also use capabilities in numpy
import numpy as np
Analysis questions
In [2]:
df = pd.read_csv("2015_trip_data.csv")
df.head()
Out[2]:
Suppose we want o analyze the flow of bicycles from and to stations.
Question: What data do we need for this visualization? How do we get it?
In [3]:
from_counts = pd.value_counts(df.from_station_id)
to_counts = pd.value_counts(df.to_station_id)
In [4]:
from_counts.head()
Out[4]:
In [5]:
type(from_counts)
Out[5]:
In [6]:
to_counts.head()
Out[6]:
Question: How we would get the same information using groupby?
Let's address the question "Which stations have the biggest difference between the in-flow and out-flow of bicycles?"
What kind of objects are returned from pd.value_counts? Are these plottable? How do we figure this out?
In [12]:
from_counts.plot.bar()
Out[12]:
We can compare from and to counts with sidey-by-side plots. But to do this, we need a DataFrame with these counts.
In [13]:
df_counts = pd.DataFrame({'from':from_counts, 'to': to_counts})
In [16]:
df_counts.plot(kind='bar', subplots=True, grid=True, title="Counts",
layout=(1,2), sharex=True, sharey=False, legend=False, figsize=(12, 8))
Out[16]:
Question: How do we make the plots bigger?
But this plot doesn't tell us about the difference between "from" and "to" counts. We want to subtract to_counts from from_counts. Will this difference be plottable?
In [17]:
# What is the index for df_counts?
df_counts.head()
Out[17]:
In [ ]:
(from_counts-to_counts).plot.bar()
Question: How do we get rid of the garbage data for the station "Pronto"?
In [22]:
df1 = df_counts[df_counts.index=='Pronto shop']
df1
Out[22]:
In [24]:
df_counts[df_counts.index!='Pronto shop'].plot.bar(figsize=(10,6))
Out[24]:
Some issues:
We want to get rid of the row 'Pronto shop' in both from_counts and to_counts.
In [ ]:
# Selecting a row
from_counts[from_counts.index == 'Pronto shop']
In [ ]:
# Deleting a row
new_from_counts = from_counts[from_counts.index != 'Pronto shop']
new_from_counts.plot.bar()
In [ ]:
def simple_clean_rows(df):
"""
Removes from df all rows with the specified indexes
:param pd.DataFrame or pd.Series df:
:return pd.DataFrame or pd.Series:
"""
df = df[df.index != 'Pronto Shop']
return df
In [ ]:
def clean_rows(df, indexes):
"""
Removes from df all rows with the specified indexes
:param pd.DataFrame or pd.Series df:
:param list-of-str indexes
:return pd.DataFrame or pd.Series:
"""
for idx in indexes:
df = df[df.index != idx]
return df
In [ ]:
dff = clean_rows(to_counts, ['Pronto Shop', 'CBD-13'])
dff.plot.bar()
Does clean_rows need to return df to effect the change in df?
In [ ]:
to_counts = clean_rows(to_counts, ['Pronto shop'])
to_counts.plot.bar()
In [ ]:
from_counts = clean_rows(from_counts, ['Pronto shop'])
from_counts.plot.bar()
In [ ]:
to_counts.head()
Let's take a more detailed approach to plotting so we can better control what gets rendered.
In this section, we show how to control various elements of plots to produce a desired visualization. We'll use the package matplotlib, a python package that is modelled after MATLAB style plotting.
Make a dataframe out of the count data.
In [ ]:
df_counts = pd.DataFrame({'From': from_counts.sort_index(), 'To': to_counts.sort_index()})
Need to align the counts by the station. Do we do this?
In [ ]:
df_counts.head()
In [ ]:
"""
Basic bar chart using matplotlib
"""
n_groups = len(df_counts.index)
index = np.arange(n_groups) # The "raw" x-axis of the bar plot
fig = plt.figure(figsize=(12, 8)) # Controls global properties of the bar plot
rects1 = plt.bar(index, df_counts.From)
plt.xlabel('Station')
plt.ylabel('Counts')
plt.xticks(index, df_counts.index) # Convert "raw" x-axis into labels
_, labels = plt.xticks() # Get the new labels of the plot
plt.setp(labels, rotation=90) # Rotate labels to make them readable
plt.title('Station Counts')
plt.show()
Issue - much more code, which will tend to be copied and pasted.
Solution - MAKE A FUNCTION NOW!!!
In [ ]:
def plot_bar1(df, column, opts):
"""
Does a bar plot for a single column.
:param pd.DataFrame df:
:param str column: name of the column to plot
:param dict opts: key is plot attribute
"""
n_groups = len(df.index)
index = np.arange(n_groups) # The "raw" x-axis of the bar plot
rects1 = plt.bar(index, df[column])
if 'xlabel' in opts:
plt.xlabel(opts['xlabel'])
if 'ylabel' in opts:
plt.ylabel(opts['ylabel'])
if 'xticks' in opts and opts['xticks']:
plt.xticks(index, df.index) # Convert "raw" x-axis into labels
_, labels = plt.xticks() # Get the new labels of the plot
plt.setp(labels, rotation=90) # Rotate labels to make them readable
else:
labels = ['' for x in df.index]
plt.xticks(index, labels)
if 'ylim' in opts:
plt.ylim(opts['ylim'])
if 'title' in opts:
plt.title(opts['title'])
In [ ]:
fig = plt.figure(figsize=(12, 8)) # Controls global properties of the bar plot
opts = {'xlabel': 'Stations', 'ylabel': 'Counts', 'xticks': True, 'title': 'A Title'}
plot_bar1(df_counts, 'To', opts)
We want to encapsulate the plotting of N variables into a function. We could re-write plot_bar1. But other plots use this. Besides plot_bar1 is pretty good at handling a single plot. So, instead we use plot_bar1 in a new function.
In [ ]:
def plot_barN(df, columns, opts):
"""
Does a bar plot for a single column.
:param pd.DataFrame df:
:param list-of-str columns: names of the column to plot
:param dict opts: key is plot attribute
"""
num_columns = len(columns)
local_opts = dict(opts) # Make a deep copy of the object
idx = 0
for column in columns:
idx += 1
local_opts['xticks'] = False
local_opts['xlabel'] = ''
if idx == num_columns:
local_opts['xticks'] = True
local_opts['xlabel'] = opts['xlabel']
plt.subplot(num_columns, 1, idx)
plot_bar1(df, column, local_opts)
In [ ]:
fig = plt.figure(figsize=(12, 8)) # Controls global properties of the bar plot
opts = {'xlabel': 'Stations', 'ylabel': 'Counts', 'ylim': [0, 8000]}
plot_barN(df_counts, ['To', 'From'], opts)
Question: How write tests for plot_barN?
Exercise
To make decisions about the truck trips required to adjust bikes at stations, we need to know the variations by day.
Want a bar plot with average daily "to" and "from" with their standard deviations.
Need to:
(Assumes that a station has at least one rental every day.)
In [ ]:
df.head()
Let's start with the values for starttime. What type are these?
In [ ]:
print (df.starttime[0])
print (type(df.starttime[0]))
Question: How do we extract the day from a string?
YOU DON'T!!! You convert it to a datetime object.
In [ ]:
this_datetime = pd.to_datetime(df.starttime[0])
print this_datetime
In [ ]:
this_datetime.dayofyear
In [ ]:
start_day = []
for time in df.starttime:
start_day.append(pd.to_datetime(time).dayofyear)
In [ ]:
start_day[2]
In [ ]:
start_day = [pd.to_datetime(time).dayofyear for time in df.starttime]
stop_day = [pd.to_datetime(x).dayofyear for x in df.stoptime]
In [ ]:
df['startday'] = start_day # Creates a new column named 'startday'
df['stopday'] = stop_day
In [ ]:
df.head()
In [ ]:
groupby_day_from = df.groupby(['from_station_id', 'startday']).size()
groupby_day_from.head()
In [ ]:
groupby_day_to = df.groupby(['to_station_id', 'stopday']).size()
groupby_day_to.head()
Now we need to compute the average value and its standard deviation across the days for each station. The groupby produced a MultiIndex. So, further operations on the result must take this into account.
In [ ]:
h_index = groupby_day_from.index
h_index.levshape # Size of the components of the MultiIndex
In [ ]:
from_means = groupby_day_from.groupby(level=[0]).mean() # Computes the mean of counts by day
from_stds = groupby_day_from.groupby(level=[0]).std() # Computes the standard deviation
In [ ]:
groupby_day_to = df.groupby(['to_station_id', 'startday']).size()
to_means = groupby_day_to.groupby(level=[0]).mean() # Computes the mean of counts by day
to_stds = groupby_day_to.groupby(level=[0]).std() # Computes the standard deviation
In [ ]:
df_day_counts = pd.DataFrame({'from_mean': from_means, 'from_std': from_stds, 'to_mean': to_means, 'to_std': to_stds})
df_day_counts.head()
In [ ]:
"""
Plotting two variables as a bar chart with error bars
"""
n_groups = len(df_day_counts.index)
index = np.arange(n_groups) # The "raw" x-axis of the bar plot
fig = plt.figure(figsize=(12, 8)) # Controls global properties of the bar plot
bar_width = 0.35 # Width of the bars
opacity = 0.6 # How transparent the bars are
#VVVV Changed to do two plots with error bars
error_config = {'ecolor': '0.3'}
rects1 = plt.bar(index, df_day_counts.from_mean, bar_width,
alpha=opacity,
color='b',
yerr=df_day_counts.from_std,
error_kw=error_config,
label='From')
rects2 = plt.bar(index + bar_width, df_day_counts.to_mean, bar_width,
alpha=opacity,
color='r',
yerr=df_day_counts.to_std,
error_kw=error_config,
label='to')
#^^^^ Changed to do two plots with error bars
plt.xticks(index + bar_width / 2, df_counts.index)
_, labels = plt.xticks() # Get the new labels of the plot
plt.setp(labels, rotation=90) # Rotate labels to make them readable
plt.legend()
plt.xlabel('Station')
plt.ylabel('Counts')
plt.title('Station Counts')
plt.show()